Double-Layer Detection Model of Malicious PDF Documents Based on Entropy Method with Multiple Features

Traditional PDF document detection technology usually builds a rule or feature library for specific vulnerabilities and therefore is only fit for single detection targets and lacks anti-detection ability. To address these shortcomings, we build a double-layer detection model for malicious PDF documents based on an entropy method with multiple features. First, we address the single detection target problem with the fusion of 222 multiple features, including 130 basic features (such as objects, structure, content stream, metadata, etc.) and 82 dangerous features (such as suspicious and encoding function, etc.), which can effectively resist obfuscation and encryption. Second, we generate the best set of features (a total of 153) by creatively applying an entropy method based on RReliefF and MIC (EMBORAM) to PDF samples with 37 typical document vulnerabilities, which can effectively resist anti-detection methods, such as filling data and imitation attacks. Finally, we build a double-layer processing framework to detect samples efficiently through the AdaBoost-optimized random forest algorithm and the robustness-optimized support vector machine algorithm. Compared to the traditional static detection method, this model performs better for various evaluation criteria. The average time of document detection is 1.3 ms, while the accuracy rate reaches 95.9%.


Introduction
In recent years, the number of network attacks through malicious documents has increased dramatically. Such attacks are often accompanied by serious harm, such as phishing, organization monitoring, and denial of service attacks. In network attacks based on malicious documents, the PDF document type accounts for a large proportion. According to the statistics of F-Secure Security, in 2020, malicious document attacks related to Adobe Reader accounted for 60% of total document attacks. Attackers can use embedded scripts, remote links and other means to carry out attacks through the plentiful functions of Adobe related products. The detection of such concealed attacks is difficult [1].
Among PDF-related attacks, constructing document vulnerabilities by exploiting defects of Adobe software is extremely harmful [2][3][4][5]. Through exploiting the vulnerabilities of document readers or parsers, such attacks can cause various types of harm, including downloading malicious programs remotely, implementing backdoor implantation, and executing malicious code directly. In 2020, Adobe released a security update bulletin to disclose the vulnerability CVE-2020-24432, the principle of which is that Adobe Acrobat Reader lacks strictness while censoring input validation. Attackers can execute arbitrary code in the context of the current user and cause serious damage.
Since 2018, Adobe Acrobat Reader has released more than 30 security update bulletins and disclosed more than 200 CVE vulnerabilities. Among them, there are more than 50 document vulnerabilities which are destructive and widely disseminated. Table 1 presents the typical document vulnerability information disclosed by Adobe in recent years.
In recent years, researchers have presented various methods for detecting malicious documents. The detection types are mainly divided into two categories: static and dynamic detection methods. The static method usually determines the document's nature through analyzing the content, basic attributes, basic structure, metadata, and other document features without running them [6][7][8][9], whereas the dynamic method usually achieves detection through analyzing the system calls, operation behavior, and other features in a virtual environment's running process [10][11][12]. Traditional static and dynamic detection methods have advantages and limitations. Static detection does not need to execute actual samples and is thus relatively secure with high detection efficiency, fast speed, and low cost, but it ignores the malicious code extraction from documents [13]. Dynamic detection does not need to learn samples and can intuitively find the purpose of attack via running behavior, leading to a strong robustness. However, it faces such challenges as low efficiency, low speed, enormous cost, and sometimes threats from the anti-virtual machine and anti-sandbox technology [14][15][16].
Based on the above analysis, we propose a new double-layer detection model for malicious PDF documents based on an entropy method with multiple features. First, we integrate a total of 222 basic and dangerous features, such as PDF document objects, physical logical structure, content stream, metadata, JavaScript code, dangerous functions and bytes, etc. Second, we use an entropy method based on RReliefF and MIC to filter redundant features and select 153 features as the best feature set. Third, we train the double-layer model with the optimized random forest and the support vector machine classification model through the training set of 4121 malicious samples and 4721 benign samples. Finally, we design a comparative experiment and verify the reliability of the model through the test set of 1374 malicious samples and 1574 benign samples. The result shows our model has an accuracy rate of 95.9% and is better than the contrast model.
The main contributions of this paper are as follows: (1) We propose a multi-feature fusion extraction method. It not only extracts basic features such as PDF document objects, physical structure, and content stream but also common dangerous features based on frequency induction, which solves the problem of a single detection target faced by conventional detection methods and effectively resists obfuscation and encryption. (2) We present a feature-selection-filtering module: EMBORAM. Combined with the weights of features calculated by RReliefF and correlation of features and tags calculated by MIC, the best feature set generated by the entropy method can resist anti-detection methods such as data filling and imitation attacks. (3) We construct a double-layer processing framework through the AdaBoost-optimized random forest and the robustness-optimized support vector machine classification model. The deficiencies of models in PDF document detection are improved and the detection effect is enhanced through optimization and combination.  (4) We collect a large amount of data with a comprehensive coverage of the training samples. Experiments verify that the detection accuracy rate is high at low time consumption. As the result shows, this model can efficiently detect malicious PDFs adopting common attack methods.
The rest of this paper is organized as follows. Section 2 introduces background and related work. Section 3 discusses in detail the double-layer detection model and the specific implementation of each module. Section 4 introduces the comparison experiments and analysis. Section 5 summarizes the work and gives an outlook.

Background and Related Work
This module mainly introduces the background of PDF and analyzes the realization and shortcomings of traditional methods.

PDF Background
The newest format of PDF was published as ISO 32000-1:2020 [17]. According to the standard, the basic structure of PDF documents is mainly divided into four parts: objects, physical structure, logical structure, and content stream [18,19], as shown in Figure 1. optimized random forest and the robustness-optimized support vector machine classification model. The deficiencies of models in PDF document detection are improved and the detection effect is enhanced through optimization and combination. (4) We collect a large amount of data with a comprehensive coverage of the training samples. Experiments verify that the detection accuracy rate is high at low time consumption. As the result shows, this model can efficiently detect malicious PDFs adopting common attack methods.
The rest of this paper is organized as follows. Section 2 introduces background and related work. Section 3 discusses in detail the double-layer detection model and the specific implementation of each module. Section 4 introduces the comparison experiments and analysis. Section 5 summarizes the work and gives an outlook.

Background and Related Work
This module mainly introduces the background of PDF and analyzes the realization and shortcomings of traditional methods.

PDF Background
The newest format of PDF was published as ISO 32000-1:2020 [17]. According to the standard, the basic structure of PDF documents is mainly divided into four parts: objects, physical structure, logical structure, and content stream [18][19], as shown in Figure 1. The details of each part are as follows. (a) Objects. As the main part of PDF documents, objects carry various content, such as text information, fonts, embedded pictures, embedded videos, hyperlinks, and bookmarks. But the basic structure of different objects is similar regardless of the content classification. As shown in Figure 1, the first line in objects is the identifier, which consists of two numbers. The first is the serial number of objects. The second is the generation number of objects and is used to indicate whether the object has been modified. (b) Physical structure. This is mainly composed of four parts: file header, file body, cross-reference table, and file tail. The file header with a simple and fixed format is used to indicate the PDF version. The file body composed of document objects is the core part of the PDF. The cross-reference table is used to index document objects. The file tail is mainly used to save the summary, location, and other related information of the cross-reference table. (c) Logical structure. In the actual parsing process, PDFs are not parsed through physical structure but through logical structure. Parsing begins with the root node indicated by the file tail. The node indicates the directory, which contains pages, outlines, and other types of information. Each type of information is also The details of each part are as follows. (a) Objects. As the main part of PDF documents, objects carry various content, such as text information, fonts, embedded pictures, embedded videos, hyperlinks, and bookmarks. But the basic structure of different objects is similar regardless of the content classification. As shown in Figure 1, the first line in objects is the identifier, which consists of two numbers. The first is the serial number of objects. The second is the generation number of objects and is used to indicate whether the object has been modified. (b) Physical structure. This is mainly composed of four parts: file header, file body, cross-reference table, and file tail. The file header with a simple and fixed format is used to indicate the PDF version. The file body composed of document objects is the core part of the PDF. The cross-reference table is used to index document objects. The file tail is mainly used to save the summary, location, and other related information of the cross-reference table. (c) Logical structure. In the actual parsing process, PDFs are not parsed through physical structure but through logical structure. Parsing begins with the root node indicated by the file tail. The node indicates the directory, which contains pages, outlines, and other types of information. Each type of information is also organized in a tree structure. (d) Content stream. This is a common form of objects in PDFs and plays a key role in storing data. Stream objects are composed of three parts. The first part is a dictionary, which mainly stores the length and encoding method. The second part is the keyword, which is unified in different stream objects. It usually starts with "stream" and ends with "endstream". The third part is the data between keywords.

Static Detection Method
Currently, the common static detection methods are mainly divided into three categories. The first category tends to detect the content features of files, mainly to extract suspicious JavaScript code fragments, shellcode data fragments, and metadata content in PDF documents. According to Tzermias et al. [11], more than 90% of malicious PDF document attacks need to be implemented with JavaScript and other codes. The detection model proposed by Laskov et al. [20] extracts JavaScript and uses lexical tagging to build an OCSVM classification. However, such methods are insufficient for PDF documents that do not rely on JavaScript code. Some document vulnerabilities build attack chains with the help of the document format. The second category tends to detect the structural features of documents. It achieves detection mainly through extracting document structure and combining features such as metadata. Šrndić et al. [21] extracted vast features of basic structure for PDF but also had limitations in extracting malicious features. Cohen et al. [7] adopted the SFEM method to extract features from the document structure. Chandran et al. [22] scanned the structure of the PDFs through PeePDF and used the GRU model to employ classification. Srndic et al. [23] processed metadata in a similar way to structural paths and then substituted the data into classification models to achieve detection. But such methods have limited abstraction of features and are not comprehensive enough to detect content features. The static methods above are not comprehensive in feature extraction, resulting in insufficient detection for various attack methods. The third category tends to build the feature library and use multiple features. Wen Weiping et al. [24] designed a feature library for document vulnerabilities. The malicious document can be identified when it matches the relevant features of the feature library. However, such methods only apply to malicious PDF documents with disclosed vulnerabilities and have no detection effect on 0-day vulnerabilities. Falah et al. [25] used feature engineering to evaluate multiple features and detect malicious PDFs, but they ignored malicious features in JavaScript code, and the feature evaluation method deserves improvement.

Dynamic Detection Method
Dynamic detection methods mainly focus on JavaScript code and shellcode data fragments embedded in the document. The MDScan method [14] mainly executes the extracted JavaScript code. It extracts the relevant operation performance of memory as a sequence and performs subsequent detection. But these similar matching methods have limitation in detecting new type of attacks. Iwamoto et al. [26] used the simulation method to execute the document shellcode. It is mainly based on the entropy of the byte sequence, which can solve the problem of difficult vulnerability triggering to some extent. However, this detection is insufficient for some malicious codes that can be only triggered in a specific situation. Xu et al. [27] proposed opening the PDF document with the same reader in the heterogeneous operating system and identified malicious documents through the similarity performance of system calls and process tracking. However, such methods have an excessive overhead and low detection efficiency. Liu et al. [28] executed JavaScript code in PDF documents through their own built-in execution environment and monitored common malicious behaviors. But such methods can only detect traditional and common attack methods. It is insufficient to detect the document using new anti-detection method. In summary, the dynamic detection method is expensive and consumes large amounts of resources and memory space. It is not suitable for situations with large numbers of samples, short time requirements, and low resource requirements [29].

Double-Layer Detection Model Based on Entropy Method with Multiple Features
In order to make up for the shortcomings of traditional detection methods, this paper strives to improve from the following perspectives: (a) comprehensive detection of malicious PDFs, which can achieve effective detection of a variety of vulnerability attacks; (b) effective defense against common anti-detection methods, such as data stuffing, mimicry attacks, etc.; (c) fast detection, low cost, and low consumption. Therefore, we design a double-layer detection model based on an entropy method with multiple features. This section focuses on the specific introduction of the model, including model outline, model framework, theoretical support, and specific implementation methods.

Overview of the Model
Based on the analysis of benign and malicious samples, it can be concluded that there are some differences between two types of samples in terms of basic and dangerous features. Therefore, we can effectively detect malicious documents through constructing the training model via supervised learning. The model framework of this paper is shown in Figure 2.

Double-Layer Detection Model Based on Entropy Method with Multiple Features
In order to make up for the shortcomings of traditional detection methods, this paper strives to improve from the following perspectives: (a) comprehensive detection of malicious PDFs, which can achieve effective detection of a variety of vulnerability attacks; (b) effective defense against common anti-detection methods, such as data stuffing, mimicry attacks, etc.; (c) fast detection, low cost, and low consumption. Therefore, we design a double-layer detection model based on an entropy method with multiple features. This section focuses on the specific introduction of the model, including model outline, model framework, theoretical support, and specific implementation methods.

Overview of the Model
Based on the analysis of benign and malicious samples, it can be concluded that there are some differences between two types of samples in terms of basic and dangerous features. Therefore, we can effectively detect malicious documents through constructing the training model via supervised learning. The model framework of this paper is shown in Figure 2. As the figure shows, we first extract and fuse the basic and dangerous features of the document. Then, we select and filter features through EMBORAM. Finally, we train the feature set in the optimized classification model. The detection model is mainly divided into four modules: the basic feature extraction module, dangerous feature extraction module, EMBORAM module, and optimization training module. The first two modules implement extraction and fusion of multiple features, the third module implements selection and filtering of features, and the fourth module implements optimization, training, and detection.

Basic Feature Extraction Module
Basic features of PDF are mainly divided into four parts: objects, physical structure, logical structure, and content stream. To extract PDF document features more comprehensively, this module also extracts metadata of PDFs.
The basic feature extraction work references four mainstream PDF document analysis tools, including PeePDF [30], PDFParser [31], PDFTear [32], and PDFRate [33]. The above tools can extract PDF features and analyze PDF maliciousness but have certain limitations. Therefore, this paper combines four mainstream tools with designed regular expressions to achieve a comprehensive extraction and fusion of PDF basic features. As the figure shows, we first extract and fuse the basic and dangerous features of the document. Then, we select and filter features through EMBORAM. Finally, we train the feature set in the optimized classification model. The detection model is mainly divided into four modules: the basic feature extraction module, dangerous feature extraction module, EMBORAM module, and optimization training module. The first two modules implement extraction and fusion of multiple features, the third module implements selection and filtering of features, and the fourth module implements optimization, training, and detection.

Basic Feature Extraction Module
Basic features of PDF are mainly divided into four parts: objects, physical structure, logical structure, and content stream. To extract PDF document features more comprehensively, this module also extracts metadata of PDFs.
The basic feature extraction work references four mainstream PDF document analysis tools, including PeePDF [30], PDFParser [31], PDFTear [32], and PDFRate [33]. The above tools can extract PDF features and analyze PDF maliciousness but have certain limitations. Therefore, this paper combines four mainstream tools with designed regular expressions to achieve a comprehensive extraction and fusion of PDF basic features.
The comparison between this paper and four mainstream analysis tools is shown in Table 2. As the table shows, we can ensure the comprehensiveness of the feature set and improve the detection accuracy. First, we use the above tools to extract features from PDF. Then, we use regular expressions to match different feature symbols. After that, we can extract the specific feature content and its corresponding information, such as quantity, location, etc. The symbols matched by regular expressions in this paper are shown in Appendix A.
Therefore, with the help of four mainstream PDF document analysis tools and the designed regular expressions, this module extracts a total of 130 basic features of PDF documents. Detailed information on basic features extracted in this paper is shown in Appendix B.

Dangerous Feature Extraction Module
Dangerous feature extraction is mainly for PDF documents with JavaScript attacks. After analyzing a number of document vulnerabilities, we found that malicious PDF generally achieve attack by PDF reader vulnerabilities. Among them, buffer overflow attacks account for the majority. In order to achieve command execution after the buffer overflow, the attack usually needs to work with JavaScript. Therefore, the dangerous feature extraction mainly focuses on JavaScript features.
Fernandez et al. [34] collected and classify samples of malicious code of web pages, which are divided into different categories. Each category contains a type of malicious JavaScript code and its variations. The code in malicious PDF is similar to that in malicious web pages. But JavaScript in PDFs has some characteristics of its own. Therefore, we need to add some malicious features for PDFs to create a comprehensive extraction. We take the CVE-2020-24432 sample as an example. The vulnerability achieves arbitrary command execution through the input validation defect of Adobe.
The attack is exploited through overflow vulnerability. It hijacks the return address to the filling data built by JavaScript and finally executes malicious JavaScript code. We can find that the malicious code often creates obfuscation, encoding, and decoding operations. By highly obfuscating JavaScript code, it can effectively counteract the detection of malicious JavaScript in PDFs. De-obfuscating such code is also difficult without running it. However, we can extract dangerous features of JavaScript as training data and achieve detection through multidimensional feature classification rather than JavaScript code analysis. Therefore, we must summarize common malicious JavaScript operations and use them as our dangerous feature set. We analyze 10 types of typical malicious PDF files that attack through JavaScript code, as shown in Table 3.  We use the basic feature extraction method to extract the JavaScript objects of the above malicious PDF files. To extract and summarize the dangerous features, we need to perform the corresponding word separation operation in the JavaScript code and count the frequency. We use the special character separation method. The special characters for separation are: ","; "("; ")"; "%"; ";"; "+"; "/"; "'"; "="; "<"; ">"; " ". We divide characters into four categories according to the feature type: (1)  For the above PDF files, we analyzed word frequency, and the final results are shown in Figure 3. We use the basic feature extraction method to extract the JavaScript objects of the above malicious PDF files. To extract and summarize the dangerous features, we need to perform the corresponding word separation operation in the JavaScript code and count the frequency. We use the special character separation method. The special characters for separation are: ","; "("; ")"; "%"; ";"; "+"; "/"; "'"; "="; "<"; ">"; " ". We divide characters into four categories according to the feature type: (1)  For the above PDF files, we analyzed word frequency, and the final results are shown in Figure 3. As shown in the figure, the number of characters extracted in JavaScript is 82, and the number of characters in each type are 26, 6, 23, and 27, respectively. In addition, we add some macro features, e.g., the proportion of specific characters. The features summarized in this paper are shown in Appendix C.

EMBORAM Module
In the previous section, we implemented multi-feature fusion extraction and extracted a total of 222 relevant features from PDF documents. But the number of features is not necessarily proportional to the detection accuracy. Some of them are redundant features, which have a negative impact on the detection accuracy. Therefore, we need to perform feature selection and filtering to eliminate redundant features and improve the detection accuracy of the model. In this module, we creatively raise the entropy method based on RReliefF and MIC (EMBORAM). According to the weights of features calculated using RReliefF and correlation of features and tags calculated using MIC, we use an entropy method to score each feature and set a filter criterion to greedily select the best feature set. Because the attacker does not know the details of the set, the filler or imitation data constructed by the attacker have a certain probability of being filtered by this module. As shown in the figure, the number of characters extracted in JavaScript is 82, and the number of characters in each type are 26, 6, 23, and 27, respectively. In addition, we add some macro features, e.g., the proportion of specific characters. The features summarized in this paper are shown in Appendix C.

EMBORAM Module
In the previous section, we implemented multi-feature fusion extraction and extracted a total of 222 relevant features from PDF documents. But the number of features is not necessarily proportional to the detection accuracy. Some of them are redundant features, which have a negative impact on the detection accuracy. Therefore, we need to perform feature selection and filtering to eliminate redundant features and improve the detection accuracy of the model. In this module, we creatively raise the entropy method based on RReliefF and MIC (EMBORAM). According to the weights of features calculated using RReliefF and correlation of features and tags calculated using MIC, we use an entropy method to score each feature and set a filter criterion to greedily select the best feature set. Because the attacker does not know the details of the set, the filler or imitation data constructed by the attacker have a certain probability of being filtered by this module. Therefore, we can not only improve the accuracy of detection but also effectively counter attacker's anti-detection methods. The frame of the module is shown in Figure 4. Therefore, we can not only improve the accuracy of detection but also effectively counter attacker's anti-detection methods. The frame of the module is shown in Figure 4. We set the initial number of PDF features as ( = 222). The RReliefF algorithm can discover the strong dependencies between attributes and has strong robustness. It also can be used for classification tasks.
Then, we calculate the information entropy of and according to the normalization result, as in the following equation.
Therefore, the weights of and can be separately calculated, as in the following equation.
Then, we build a filter criterion as in Equation (4), which represents the weighted average for the relevance of the new feature set to the category. represents the selected feature set which contains features . represents the feature to be selected. and balance the importance of RReliefF and MIC. * ∑ ( ) + ( ) Finally, a forward search strategy is adopted to greedily select the features that maximize the criterion until the number of features reaches the selection proportion threshold (after the final experiment. The best result is achieved when = 0.69 while the number of feature set is 153). We use the finalized feature set as our selected features for model training and thus minimize the interference of redundant data. The entropy method is excellent and objective in this assignment field. It is based on the degree of variation of indicators. First, we normalize r and q as calculated by RReliefF and MIC, as in the following equation.

Optimization Training Module
Then, we calculate the information entropy of r and q according to the normalization result, as in the following equation.
Therefore, the weights of r and q can be separately calculated, as in the following equation.
Then, we build a filter criterion as in Equation (4), which represents the weighted average for the relevance of the new feature set to the category. S represents the selected feature set which contains features s. f represents the feature to be selected. η and θ balance the importance of RReliefF and MIC.
Finally, a forward search strategy is adopted to greedily select the features that maximize the criterion until the number of features reaches the selection proportion threshold K (after the final experiment. The best result is achieved when K = 0.69 while the number of feature set is 153). We use the finalized feature set as our selected features for model training and thus minimize the interference of redundant data.

Optimization Training Module
In this paper, we use the random forest and the support vector machine classification models. Both of them are suitable for processing high-dimensional data, and thus-in the case of PDF documents with a large number of features-the model can be trained with less computational overhead and greater robustness. But both classification models have certain drawbacks for PDF document detection. Therefore, we implement some optimization and design a double-layer detection model. In this way, we can make up the shortcomings of both models, which are unstable and susceptible to noise, to some extent.

AdaBoost-Optimized Random Forest Classification Model
The original random forest classification model has some limitations. The classifier will be over-fitted if there is noise in the features. Although we use EMBORAM to select the best features in the early section, there may still be noise in the feature set. Therefore, we create AdaBoost-optimized random forest classification to improve the overall training and classification effect.
AdaBoost is essentially an adaptive boosting algorithm. It can adjust the weights of each classification according to the accuracy. Random forest collects samples randomly to form subsets. Each subset builds a decision tree after training. Finally, it uses decision trees to build the random forest and conducts classification by equally voting. To improve the effect, we use the AdaBoost algorithm to adjust voting weights of decision trees in random forest. Decision trees with stronger classification ability have higher voting weights. We obtain the final results through the maximum voting principle. The optimized method is as shown in Figure 5. In this paper, we use the random forest and the support vector machine classification models. Both of them are suitable for processing high-dimensional data, and thus-in the case of PDF documents with a large number of features-the model can be trained with less computational overhead and greater robustness. But both classification models have certain drawbacks for PDF document detection. Therefore, we implement some optimization and design a double-layer detection model. In this way, we can make up the shortcomings of both models, which are unstable and susceptible to noise, to some extent.

AdaBoost-Optimized Random Forest Classification Model
The original random forest classification model has some limitations. The classifier will be over-fitted if there is noise in the features. Although we use EMBORAM to select the best features in the early section, there may still be noise in the feature set. Therefore, we create AdaBoost-optimized random forest classification to improve the overall training and classification effect.
AdaBoost is essentially an adaptive boosting algorithm. It can adjust the weights of each classification according to the accuracy. Random forest collects samples randomly to form subsets. Each subset builds a decision tree after training. Finally, it uses decision trees to build the random forest and conducts classification by equally voting. To improve the effect, we use the AdaBoost algorithm to adjust voting weights of decision trees in random forest. Decision trees with stronger classification ability have higher voting weights. We obtain the final results through the maximum voting principle. The optimized method is as shown in Figure 5. (1) We use the dataset and features to build the initial decision trees ℎ ( ) in the random forest. (2) The classification weight of the decision trees for each category is calculated based on the classification ability. We initialize training sample weights, which means the weight of decision tree for sample , as in the following equation.
For the decision trees = 1: , we loop from (a) to (c).
(a) We calculate the error rate of the decision tree ℎ ( ) to the category , as in the following equation. We note the extracted features for each PDF sample as F and note the number of decision trees as T. The dataset is {(x 1 , y 1 ), (x 2 , y 2 ), . . . , (x n , y n )}, where x i ∈ F, y i ∈ Y = {ben, mal}. We note the initial classifier as H, H = {h 1 , h 2 . . . h T }. The optimization steps are as follows.
(1) We use the dataset and features F to build the initial decision trees h t (x) in the random forest. (2) The classification weight of the decision trees for each category is calculated based on the classification ability. We initialize training sample weights, which means the weight of decision tree t for sample i, as in the following equation.
For the decision trees t = 1 : T, we loop from (a) to (c).
(a) We calculate the error rate ε C t of the decision tree h t (x) to the category C, as in the following equation.
(b) If ε C t > 1/2, the decision tree will be dropped and this cycle will be finished. Because AdaBoost is designed to binary classification algorithm, which requires the error rate to be less than 1/2 (random guessing probability).
(c) If ε C t < 1/2, we calculate the weight of the decision tree h t (x) for the category C, as in the following equation.
(d) We can update the weight of decision tree h i in the random forest, as in the following equation.
Z i is the normalization factor, as in the following equation.
The α C t is the weight of decision trees for the category C. (3) We sequentially set C = ben, mal and repeat step (2). The final voting weights of the decision trees for two categories can be obtained.

Robust Optimized Support Vector Machine Classification Model
The support vector machine has the advantage of high-dimensional feature-handling and meets the need for PDF document classification. However, it is more suitable for situations with small sample sizes. For situations such as malicious PDF detection that require a large number of samples for training, there is a problem of unstable classification. Therefore, we perform robustness optimization of the support vector machine. Christmann, A. et al. [35] proposed an HL-SVM model. They imported the regularization parameter and constrained some unstable samples in the training data.
We note the extracted features for each PDF sample as F and note the training set of all PDF documents as T, T = {(x 1 , y 1 ), (x 2 , y 2 ), . . . , (x n , y n )}, where x i ∈ F, y i ∈ Y = {ben, mal}. The optimization equation for the HL-SVM model is as follows.
Among them, a T x i + b = 0 is the classification hyperplane of the support vector machine; λ > 0 is the regularization parameter; the function [ ] + = max 0, 1 − y i a T x i + b constrains the samples which affect the stability and accuracy of the hyperplane. But such a method will cause great changes in the hyperplane and reduce the accuracy for future samples, as shown in Figure 6b.
We set a boundary range by moving the hyperplane. θ i is used to judge whether the sample is in the boundary set. If the sample meets the boundary range, then θ i = 0, and it is rejected from participating in the learning of the support vector machine model, as shown in Figure 6c.
As can be seen in the figure, the improved support vector machine incorporates unstable samples into the learning range, resulting in limitation in classification stability. The HL-SVM model causes great changes in the hyperplane. In contrast, the robustnessoptimized support vector machine rejects the boundary samples from participating in model learning so that the overall stability of the model is improved. Therefore, the choice of the boundary sample selection will have an impact on model training. We note the width of boundary range as B. After the final experiment, the best result is achieved when B = 0.13, while the number of boundary set is 271.

Double-Layer Processing Framework
We use the AdaBoost-optimized random forest model and the robustness-optimized support vector machine model as classifiers. To obtain the final detection model, we need to combine the two classifiers. We build a double-layer processing framework. We first use the random forest model and then use the support vector machine model for detection. We note the result of random forest model detection as result 1 , the result of support vector machine model detection as result 2 , and the sample classification set as {ben, mal}. The final detection result is as follows.
When both detection models detect the sample as benign, the file detection result is benign. Through double-layer processing, the problem of false negatives can be solved best and the overall detection ability of the model can be improved.

Dataset
This section focuses on the sample sources used for the experiments and tests. The benign samples were collected mainly by Google [36], Yahoo Search Engine [37], and the Github react-pdf project [38]. To enhance the robustness of the experiments, we collected benign samples extensively, including various types of PDF samples with advanced features. All benign samples were scanned and checked using VirusTotal. The malicious samples were downloaded through VirusTotal. The set contains samples of malicious PDF documents from more than 100 exploits during the last 10 years. We selected a total of 5495 malicious PDF samples with 37 typical vulnerabilities from the set. Among the malicious samples, there are 1074 samples with high obfuscation and encryption (malicioushigh samples). All the samples were detected as malicious through VirusTotal and sandbox. The basic attributes of the samples are shown in Table 4. According to the statistics of the basic attributes, it can be seen that more than 60% of the malicious samples are smaller than 100 KB. But most of the benign PDFs crawled through search engines are larger than 100 KB in size. Therefore, we used the react-pdf project to create some PDF files smaller than 100 KB in size. In this way, the robustness of the model can be improved because the features of these samples fit the malicious documents.

Environment
The experiment is conducted using an NVIDIA RTX3090 GPU, connected by an Intel (R) Core (TM) i7-1165G7 @ 2.80 GHz. The software is Python3.7.1. In terms of settings for random forest model, the max_features is 0.7, n_estimators is 70, and max_depth is 8. In terms of settings for the support vector machine, the kernel function is rbf. In terms of dataset, we select 75% of benign and malicious PDF samples as the training set (4721 benign samples and 4121 malicious samples) and 25% as the test set (1574 benign samples and 1374 malicious samples) via the random sampling method.

Procedure
We first investigate the feature selection proportion threshold K and the boundary sample selection threshold B to obtain the best threshold value. Then, we substitute the value into the subsequent comparison experiments. After the experimental process, the accuracy and performance of the model are evaluated based on the records.

The Impact of Feature Selection Proportion Threshold K
In the EMBORAM module, we utilize an entropy method based on RReliefF and MIC. In the final stage of filtering, we greedily select the features which maximize the filter criterion until the number reaches the setting proportion. The setting of the proportion threshold K affects the number of features in the final feature set. Therefore, the threshold K also has some influence on the accuracy of the model. In our experiments, we conducted traversal experiments on the threshold K. The impact effect is shown in Figure 7. According to the statistics of the basic attributes, it can be seen that more than 60% of the malicious samples are smaller than 100 KB. But most of the benign PDFs crawled through search engines are larger than 100 KB in size. Therefore, we used the react-pdf project to create some PDF files smaller than 100 KB in size. In this way, the robustness of the model can be improved because the features of these samples fit the malicious documents.

Environment
The experiment is conducted using an NVIDIA RTX3090 GPU, connected by an Intel (R) Core (TM) i7-1165G7 @ 2.80 GHz. The software is Python3.7.1. In terms of settings for random forest model, the max_features is 0.7, n_estimators is 70, and max_depth is 8. In terms of settings for the support vector machine, the kernel function is rbf. In terms of dataset, we select 75% of benign and malicious PDF samples as the training set (4721 benign samples and 4121 malicious samples) and 25% as the test set (1574 benign samples and 1374 malicious samples) via the random sampling method.

Procedure
We first investigate the feature selection proportion threshold and the boundary sample selection threshold to obtain the best threshold value. Then, we substitute the value into the subsequent comparison experiments. After the experimental process, the accuracy and performance of the model are evaluated based on the records.

The Impact of Feature Selection Proportion Threshold
In the EMBORAM module, we utilize an entropy method based on RReliefF and MIC. In the final stage of filtering, we greedily select the features which maximize the filter criterion until the number reaches the setting proportion. The setting of the proportion threshold affects the number of features in the final feature set. Therefore, the threshold K also has some influence on the accuracy of the model. In our experiments, we conducted traversal experiments on the threshold K. The impact effect is shown in Figure 7. According to the figure, we know that when the threshold is set too low, the number of the final feature set is high. Redundant features will participate in the training, According to the figure, we know that when the threshold K is set too low, the number of the final feature set is high. Redundant features will participate in the training, which will have a negative impact on the model accuracy. In the interval of 0.67 ≤ K ≤ 0.77, the model accuracy has a certain volatility. When K > 0.77, the number of the feature set is small. As a result, the number of the feature set is too low, leading to a negative impact on the model. When K = 0.69 and the number of feature sets is 153, the model accuracy reaches the highest. At this time, we can filter redundant features and retain the best features to achieve optimal learning and training effect for the model.

Filtered Features
In the EMBORAM module, we filter the best dataset with 153 features by entropy method. The final number of features, filtered number of features, and filtered proportion in each category is shown in Table 5. According to the result, the features in the objects category of basic features and the basic attribute category of dangerous features are mostly filtered. This reflects that differences in some objects of PDF samples are tiny and that the metric built using RreliefF and MIC is low, which finally leads to a lower score via the entropy method. The minimum filtered proportion is metadata features, which have the most obvious differences in PDF samples and are commonly used in relevant research of malicious PDF detection. Each category in our paper has its own selected features after being filtered, which shows the features extracted in the model are multiple and comprehensive.

The Impact of the Boundary Sample Selection Threshold B
In the training phase of the robustness-optimized support vector machine model, we define the boundary sample set. The robustness of the model is improved because the unstable samples that fall into the boundary sample set do not participate in the model learning of the support vector machine. Therefore, the method of dividing the boundary sample set affects the model training to a certain extent. According to the description in the previous section, we divide the boundary sample set by translating the classification hyperplane up and down, so the boundary sample set threshold B has some influence on model learning. We conduct experiments on threshold B, and the effect is shown in Figure 8. According to the figure, we know that when the boundary sample selection threshold is set too low, the number of samples in the boundary set is small. Almost all samples can participate in the support vector machine model learning, resulting in the negative effect on the stability of the model to some extent. When the threshold value grows, the number of samples in the boundary set is moderate. At this time, the vast majority of boundary samples can be removed from the training set, leading to the positive effect on the stability According to the figure, we know that when the boundary sample selection threshold is set too low, the number of samples in the boundary set is small. Almost all samples can participate in the support vector machine model learning, resulting in the negative effect on the stability of the model to some extent. When the threshold value grows, the number of samples in the boundary set is moderate. At this time, the vast majority of boundary samples can be removed from the training set, leading to the positive effect on the stability of the model. When the threshold value is set too high, most of the normal samples are divided into the boundary set and the actual number of training samples is small, causing a significant decrease in the accuracy rate. When B = 0.13 and the number of samples in the boundary set is 271, the model accuracy reaches its highest.

Evaluation Indicators
In this section, we design comparative experiments with other static detection models and conduct analysis. We mainly use Precision, Recall, and F1 values as evaluation criteria. The Precision is the probability that the alert is actually malicious among all the samples. The Recall is the probability that the alert is actually malicious among all the malicious samples. The F1 value is the equilibrium value that allows both precision and recall to reach the highest value, which is called Fscore.
where TP represents the number of samples which are malicious and successfully identified by the model. FP represents the number of samples which are benign but wrongly identified by the model, i.e., the number of false positive samples. FN represents the number of samples which are malicious but not identified by the model, i.e., the number of missed samples.

Contrast Model
The comparison models in this section are mainly divided into four categories, and each category contains two types of detection models: (1) traditional random forest and support vector machine detection models without EMBORAM, denoted as Tra_RF, Tra_SV M; (2) traditional random forest and support vector machine detection models with EMBO-RAM, denoted as Imp_RF, Imp_SV M; (3) optimized random forest and support vector machine detection models without EMBORAM, denoted as Tra_AdaRF, Tra_RobSV M. (4) related research in PDF detection [23,24], denoted as Smutz, Srndic. The model in this paper is denoted as Paper.

Evaluation of Competency
The evaluation is divided into two main parts. The first part is for all malicious samples. The Fscore of the paper reaches its highest when the threshold is 0.97. The accuracy of the model reaches 0.959, and the recall reaches 0.988. The second part is for highly obfuscation and encryption samples. The Fscore of the paper reaches its highest when the threshold is 0.96. The accuracy of the model reaches 0.906, and the recall reaches 0.928. In two parts, indicators are better than the comparison models, indicating that filtering the selection of features and optimization of the classification models can effectively improve the model's detection effect. The Precision, Recall, and F1 results of the nine detection models are shown in Table 6 and Figure 9. the model's detection effect. The Precision, Recall, and F1 results of the nine detection models are shown in Table 6 and Figure 9. Based on the detection results for all malicious samples, we can find that the _ , _ traditional static detection models based on the feature set extracted by this paper also achieve high accuracy and recall rate. This proves that the extracted features in this paper are comprehensive. The detection effect can be better achieved by extracting both the basic and dangerous features of the document. The accuracy of _ , _ detection models is increased considerably over the traditional detection model. This proves that the selection and filtering of the features Based on the detection results for all malicious samples, we can find that the Tra_RF, Tra_SV M traditional static detection models based on the feature set extracted by this paper also achieve high accuracy and recall rate. This proves that the extracted features in this paper are comprehensive. The detection effect can be better achieved by extracting both the basic and dangerous features of the document.
The accuracy of Imp_RF, Imp_SV M detection models is increased considerably over the traditional detection model. This proves that the selection and filtering of the features using EMBORAM has an obvious positive effect on the classifier training. The reason for this situation is that some features in the vast majority of PDF documents may only exhibit a slight difference. Such features will have a certain degree of impact and interference on the classifier learning, resulting in a negative effect.
The Tra_AdaRF, Tra_RobSV M detection models show a large improvement over traditional common detection model. This proves that optimization of the classifier will improve the detection effect of the model. It should be noted that the Imp_RF, Imp_SV M detection models have a smaller improvement in accuracy rate than the Tra_AdaRF, Tra_RobSV M detection models, indicating that the selection of the feature set has a better improvement effect compared to the optimization of the classifier. This also shows that the capabilities of the two classification models do not show a large difference.
The model in this paper also has an improvement over related research, such as Smutz, Srndic. This reflects that features extracted in this paper are more comprehensive because we realize the fusion of multiple features, especially for the dangerous features in JavaScript.
Based on the detection results for malicious-high samples, we can find that both Precision and Recall are reduced. But the comparison models Smutz and Srndic reduce the most, which indicates multiple features extracted and filtered by the paper can effectively resist obfuscation and encryption.
In order to explore the performance differences between our model and comparison models more intuitively, we count the TP, FP, and FN values of several models for all malicious samples and obtain the results shown in Figure 10.

_
, _ detection models, indicating that the selection of the feature set has a better improvement effect compared to the optimization of the classifier. This also shows that the capabilities of the two classification models do not show a large difference.
The model in this paper also has an improvement over related research, such as , . This reflects that features extracted in this paper are more comprehensive because we realize the fusion of multiple features, especially for the dangerous features in JavaScript.
Based on the detection results for malicious-high samples, we can find that both Precision and Recall are reduced. But the comparison models Smutz and Srndic reduce the most, which indicates multiple features extracted and filtered by the paper can effectively resist obfuscation and encryption.
In order to explore the performance differences between our model and comparison models more intuitively, we count the TP, FP, and FN values of several models for all malicious samples and obtain the results shown in Figure 10. According to the figure, this model has better performance than other compared models in terms of TP, FP, and FN values. We can find that the number of false positives in Category 3 is significantly higher than in Category 2 because it does not select extracted features. Some common features in benign and malicious samples can confuse the model and affect the model training. It is worth noting that if a longitudinal comparison is made between the FN and FP values, it can be found that the number of false positives is higher than the number of missed positives in this model due to the use of a double-layer processing framework. The sample is identified as malicious as long as one of the classification models alarms, so the model can maintain a small number of missed positives better.

Time Overhead
The time overhead mainly contains training time and detection time. In the training phase, we processed a total of 8842 samples. We extracted 1,962,924 features and obtained 1,352,826 features after selection and filtering. The training time mainly contains four stages. The time spent in each training stage is shown in Table 7.  According to the figure, this model has better performance than other compared models in terms of TP, FP, and FN values. We can find that the number of false positives in Category 3 is significantly higher than in Category 2 because it does not select extracted features. Some common features in benign and malicious samples can confuse the model and affect the model training. It is worth noting that if a longitudinal comparison is made between the FN and FP values, it can be found that the number of false positives is higher than the number of missed positives in this model due to the use of a doublelayer processing framework. The sample is identified as malicious as long as one of the classification models alarms, so the model can maintain a small number of missed positives better.

Time Overhead
The time overhead mainly contains training time and detection time. In the training phase, we processed a total of 8842 samples. We extracted 1,962,924 features and obtained 1,352,826 features after selection and filtering. The training time mainly contains four stages. The time spent in each training stage is shown in Table 7. The detection time mainly contains three stages. In this phase, we processed a total of 2948 samples. We extracted 451,044 features and substituted them into the training model for detection. The average time spent is 1.3 ms, and the time spent in each detection stage is shown in Table 8.
For the detection time, the basic feature extraction time is the longest (0.76 ms). This is because the module extracts a large number of features and needs to parse each structure of the PDF document. However, the average detection time of the model is short and the overhead is low, which proves that the model is efficient.

Conclusions
In this paper, we design an efficient detection model for malicious PDFs which does not rely on feature libraries but on an entropy method with multiple features. The model is divided into four modules. In two feature extraction modules, we realize multi-feature fusion and extract a total of 222 basic and dangerous features of PDF documents. Through comprehensive feature set extraction, we can effectively resist obfuscation and encryption. In the EMBORAM module, we use entropy method based on RReliefF and MIC to select and filter the feature set according to the calculated criterion. We summarize the best set of 153 features, which can better resist data padding, imitation attacks, and other antidetection methods. In the optimization training module, we build a double-layer processing framework through the AdaBoost-optimized random forest and the robustness-optimized support vector machine. Through the framework, we can improve the performance of two classification models in PDF document detection.
We process a total of 6295 benign samples and 5495 malicious samples. The experimental result shows that the EMBORAM and classifier optimization have significantly improved the model's detection performance. The model can detect the common vulnerability attacks comprehensively and can effectively resist anti-detection methods such as data stuffing and imitation attacks. Furthermore, compared to the traditional static detection model, it has higher accuracy.
In future work, we will try to use more types of classification model and combine multiple classification models hierarchically to achieve better detection results.

Conflicts of Interest:
The authors declare no conflict of interest.